If we look back at the last 15-20 years, Cybersecurity has become a very important topic across every person in the world regardless of their age and every business. This point is illustrated by two growth factor- Market Capital and Jobs. The capital market has more than doubled in spend in the US since 2010 and it will triple from 2019 until 2022 reaching to a total market capital of 180 billion. Additionally, because of the growth, there is a need for cybersecurity professionals. The job market is exrtonaray hot because the number of jobs has doubled since 2010 and there will be a total of 6 million jobs in 2019 with a 2 million gap in open positions, or what the industry calls the talent shortage. This appears to reflect a healthy growth market and the cybersecurity industry is booming.
| Spend in Cybersecurity | Demand for Security Professionals | Breaches Over Years |
|---|---|---|
| Source: Statista 2019 | Source: Frost and Sullivan | Source: Identity Threat Resource Center |
However, the reality of cybersecurity is that since 2005, the world has seen a growth in the number of annual breaches in companies by 9x. We have seen OPM, JP Morgan, Anthem, Sony, Equifax, Home Depot, Adult Friend Finder, Yahoo!, Target and recently Marriott was just breached resulting in a compromise of 500 million records. While the world has spent hundreds of Billions of dollars and hired over 6 million cybersecurity professionals, the attackers are winning and have been winning as long as the Internet has been around. The cybersecurity attackers are very organized, well-funded, and in some cases, nation state backed. Unlike many industries, attackers or criminals are making millions of dollars attacking companies, like CyptoWall, which made a reported amount of 325 million dollars.
One answer to address the attacker problem is to integrate Data Science into cybersecurity in order to predict the outcomes of attackers utilizing rich data sets. This is a new capability that cybersecurity is starting to use and could hold the key for the future. Many industry pundits believe that data science, machine learning, deep learning, and Artificial intelligence will be the keys to successfully to stop attackers. Over the last few years, companies like McAfee, Symantec, Cylance, and Crowd Strike have developed Machine Learning Behavioral Products, but these are still not 100% effective because the attackers are writing malware that can bypass these algorithms or feature sets. In order to secure the future, the cybersecurity industry must evolve and develop more advanced algorithms to prevent and stop attackers.
In order to help solve this problem, Team 4 is going to predict the probability of whether a Windows machines could get infected by various families of malware. This dataset that will be used is part of a current Microsoft Kaggle competition and was captured by Microsoft in order to help them predict if Kagglers could determine whether machines would be breached. Within the dataset, telemetry containing various properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.
For this lab, Team 4 will predict if a machine is vulnerable to a malware attack. The response variable HasDetections will be used in order to train and test the dataset. The dataset has more than 300,000 rows and 83 columns. The team will identify key features and joint attributes that determine the classification of whether the machine is susceptible to compromise. Multiple classification algorithms will be used to train the models, for example Logistic Rregression, SVM, and Random Forest in attempt to provide the best accuracy possible. In order to measure the performance of the model, Accuracy, Sensitivity, and Log Loss function are calculated. For model training, a 3-folds Cross Validation will be performed to prevent overfitting.
%matplotlib inline
from IPython.display import display
import numpy as np
import pandas as pd
# import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
#Show all columns
pd.options.display.max_columns = None
pd.options.display.max_rows = 100
#%time train = pd.read_csv("data/train.csv")
#%time final = pd.read_csv("data/final.csv")
%time final = pd.read_csv("data/clean.final.csv")
Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. Microsoft sampled the data from their telemetry so they could enter into the Kaggle competetion.
The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines. -Microsoft Kaggle competition
For the purpose of our academic research, we elected to select a single CountryIdentifier. Through our research, we reviewed the percentage breakdown by CountryIdentifier. The largest grouping of Countries share a similar makeup of HasDetections, as a result we have decided to select CountryIdentifier 141, which includes 333k rows.
countries = final.pivot_table(index='CountryIdentifier', columns='HasDetections',
values=['MachineIdentifier'],aggfunc=len, margins=True)
countries = countries.reindex(countries['MachineIdentifier'].sort_values(by='All', ascending=False).index)
# display(countries.MachineIdentifier.head(10))
countries_rate = countries.MachineIdentifier[[0,1]].div(countries.MachineIdentifier[[0,1]].sum(1).astype(float),
axis=0)
countries_rate.head(10).plot(kind='barh',
stacked=True) #figsize=(4,8))
Data Field |
Description of Data Field |
Data Type |
Data Quality |
|
|---|---|---|---|---|
HasDetections |
Ground truth and indicates that Malware was detected on the machine and will be the key attribute used to predict whether the machine has been compromised. |
Response Variable (1/0) |
||
ProductName |
Defender state information e.g. win8defender |
Categorical |
Removed 96% of its values are the same |
|
EngineVersion |
Defender state information e.g. 1.1.12603.0 |
Categorical |
||
AppVersion |
Defender state information e.g. 4.9.10586.0 |
Categorical |
||
AvSigVersion |
Defender state information e.g. 1.217.1014.0 |
|||
IsBeta |
Defender state information e.g. false |
Categorical |
Removed 100% of its values are the same. |
|
RtpStateBitfield |
Is Real Time Transport Protocol Enabled |
Categorical |
Replacing any null values with -1, indicating unknown |
|
IsSxsPassiveMode |
Is Side by Side Assempblies in Passive Mode |
Boolean |
Removed 99% of its values are the same. |
|
DefaultBrowsersIdentifier |
ID for the machine's default browser |
Categorical |
Removed 93% of its values missing are the same |
|
AVProductStatesIdentifier |
ID for the specific configuration of a user's antivirus software |
Categorical |
Replace null values with field's mode 53447 |
|
AVProductsInstalled |
Whether AV Products are installed |
Categorical |
Replace null values with 1, the highest mode |
|
AVProductsEnabled |
<p align=left"> Whether or not the AV is enabled |
Categorical |
Replaced null values with 0, indicating not enabled | |
HasTpm |
True if machine has tpm |
Boolean |
Removed 99% of its values are the same. |
|
CountryIdentifier |
ID for the country the machine is located in |
Categorical |
Filtered by 141 |
|
CityIdentifier |
ID for the city the machine is located in |
Categorical |
Replace null values with field's mode 92213 |
|
OrganizationIdentifier |
ID for the organization the machine belongs in, organization ID is mapped to both specific companies and -broad industries |
Categorical |
Replace any null values with 0, indicating no organization |
|
GeoNameIdentifier |
ID for the geographic region a machine is located in |
Categorical |
Replace null values with the field's mode 167 |
|
LocaleEnglishNameIdentifier |
English name of Locale ID of the current user |
Categorical |
||
Platform |
Calculates platform name (of OS related properties and processor property) |
Categorical |
||
Processor |
This is the process architecture of the installed operating system |
Categorical |
||
OsVer |
Version of the current operating system |
Categorical |
||
OsBuild |
Build of the current operating system |
Categorical |
||
OsSuite |
Product suite mask for the current operating system. |
Categorical |
||
OsPlatformSubRelease |
Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2) |
Categorical |
||
OsBuildLab |
Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022 |
Categorical |
Replace null value with the field's model that is aligned by OSBuild and Processor |
|
SkuEdition |
The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used since Vista and Server 2008, so there are many Product Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class. |
Categorical |
||
IsProtected |
This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected. |
Boolean |
||
AutoSampleOptIn |
This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+ |
Categorical |
Removed 99.99% of its values are the same. |
|
PuaMode |
Pua Enabled mode from the service |
Boolean |
Removed 99.99% of its values are missing |
|
SMode |
This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed |
Boolean |
Removed 93% of its values are the same |
|
IeVerIdentifier |
Determination of the last known state of the operating system |
Categorical |
Replace null values with -1, indicating unknown | |
SmartScreen |
This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry. |
Categorical |
Replaced all missings values with ExistsNotSet along with any special character values. |
|
Firewall |
This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service. |
Boolean |
||
UacLuaenable |
This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA. |
Boolean |
Removed 99% of its values are the same |
|
Census_MDC2FormFactor |
A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...) |
Categorical |
||
Census_DeviceFamily |
AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone |
Categorical |
Removed 99% of its values are the same |
|
Census_OEMNameIdentifier |
OEM Identifier COde |
Categorical |
Replace null values with Census_MDC2FormFactor's mode |
|
Census_OEMModelIdentifier |
OEM Model Identifier |
Categorical |
Replace null values with Census_MDC2FormFactor's mode |
|
Census_ProcessorCoreCount |
Number of logical cores in the processor |
Continous |
Replace null values with Census_MDC2FormFactor's median |
|
Census_ProcessorManufacturerIdentifier |
Processor Manufacturer |
Categorical |
Replace null values with Census_MDC2FormFactor's mode |
|
Census_ProcessorModelIdentifier |
Processor Model |
Categorical |
Replace null values with Census_MDC2FormFactor's mode |
|
Census_ProcessorClass |
A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated |
Categorical |
Removed 99% of its values are the same |
|
Census_PrimaryDiskTotalCapacity |
Amount of disk space on primary disk of the machine in MB |
Continous |
Replace null values with Census_MDC2FormFactor's median |
|
Census_PrimaryDiskTypeName |
Friendly name of Primary Disk Type HDD or SSD |
Categorical |
Replace null values with Census_MDC2FormFactor's mode and replace Unspecified with Other |
|
Census_SystemVolumeTotalCapacity |
The size of the partition that the System volume is installed on in MB |
Continous |
Replace null values with Census_MDC2FormFactor's median |
|
Census_HasOpticalDiskDrive |
True indicates that the machine has an optical disk drive (CD/DVD) |
Boolean |
||
Census_TotalPhysicalRAM |
Retrieves the physical RAM in MB |
Continous |
Replace null values with Census_MDC2FormFactor's median |
|
Census_ChassisTypeName |
Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx |
Categorical |
Replace null values with Census_MDC2FormFactor's mode and Updated all unknown or numerical values to Other |
|
Census_InternalPrimaryDiagonalDisplaySizeInInches |
Retrieves the physical diagonal length in inches of the primary display |
Continous |
Replace null values with Census_MDC2FormFactor's median |
|
Census_InternalPrimaryDisplayResolutionHorizontal |
Retrieves the number of pixels in the horizontal direction of the internal display. |
Continous |
Replace null values with Census_MDC2FormFactor's median |
|
Census_InternalPrimaryDisplayResolutionVertical |
Retrieves the number of pixels in the vertical direction of the internal display |
Continous |
Replace null values with Census_MDC2FormFactor's median |
|
Census_PowerPlatformRoleName |
Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device |
Categorical |
Replace any null and Unknown values with Other |
|
Census_InternalBatteryType |
Tye of Battert |
Categorical |
Removed 70% missing data and unable to classify values by Census_MDC2FormFactor |
|
Census_InternalBatteryNumberOfCharges |
Number of Charges on Battery |
Continous |
Replace null values with Census_MDC2FormFactor's median |
|
Census_OSVersion |
Numeric OS version Example 10.0.10130.0 |
Categorical |
||
Census_OSArchitecture |
Architecture on which the OS is based. Derived from OSVersionFull. Example amd64 |
Categorical |
||
Census_OSBranch |
Branch of the OS extracted from the OsVersionFull. Example OsBranch = fbl_partner_eeap where OsVersion = 6.4.9813.0.amd64fre.fbl_partner_eeap.140810-0005 |
Categorical |
||
Census_OSBuildNumber |
OS Build number extracted from the OsVersionFull. Example OsBuildNumber = 10512 or 10240 |
Categorical |
||
Census_OSBuildRevision |
OS Build revision extracted from the OsVersionFull. Example OsBuildRevision = 1000 or 16458 |
Categorical |
||
Census_OSEdition |
Edition of the current OS. Sourced from HKLM\Software\Microsoft\Windows NT\CurrentVersion@EditionID in registry. Example: Enterprise |
Categorical |
||
Census_OSSkuName |
OS edition friendly name (currently Windows only) |
Categorical |
||
Census_OSInstallTypeName |
Friendly description of what install was used on the machine i.e. clean |
Categorical |
||
Census_OSInstallLanguageIdentifier |
Language of Installer |
Categorical |
Replace null values with Unknown, unable to classify mode |
|
Census_OSUILocaleIdentifier |
Locale of Installer |
Categorical |
||
Census_OSWUAutoUpdateOptionsName |
Friendly name of the WindowsUpdate auto-update settings on the machine. |
Categorical |
||
Census_IsPortableOperatingSystem |
Indicates whether OS is booted up and running via Windows-To-Go on a USB stick. |
Boolean |
Removed 99% of its values are the same. |
|
Census_GenuineStateName |
Friendly name of OSGenuineStateID. 0 = Genuine |
Boolean |
||
Census_ActivationChannel |
Retail license key or Volume license key for a machine. |
Categorical |
||
Census_IsFlightingInternal |
Is Driver Flighting Enabled |
Boolean |
Updated null values to 0 |
|
Census_IsFlightsDisabled |
Indicates if the machine is participating in flighting. |
Boolean |
Removed 98% of its values are the same \ missing |
|
Census_FlightRing |
The ring that the device user would like to receive flights for. This might be different from the ring of the OS which is currently installed if the user changes the ring after getting a flight from a different ring. |
Categorical |
||
Census_ThresholdOptIn |
Driver Device Threshold Enabled |
Boolean |
Removed 99.99% of its values are the same |
|
Census_FirmwareManufacturerIdentifier |
Manufcator of Firemware |
Categorical |
Replace null values with Census_MDC2FormFactor's mode |
|
Census_FirmwareVersionIdentifier |
Version of Firemware |
Categorical |
Replace null values with Census_MDC2FormFactor's mode |
|
Census_IsSecureBootEnabled |
Indicates if Secure Boot mode is enabled. |
Boolean |
||
Census_IsWIMBootEnabled |
Is image file boot enabled |
Boolean |
Removed 100% of its values are the same |
|
Census_IsVirtualDevice |
Identifies a Virtual Machine (machine learning model) |
Boolean |
Removed 99% of its values are the same. |
|
Census_IsTouchEnabled |
Is this a touch device ? |
Boolean |
||
Census_IsPenCapable |
Is the device capable of pen input ? |
Boolean |
||
Census_IsAlwaysOnAlwaysConnectedCapable |
Retreives information about whether the battery enables the device to be AlwaysOnAlwaysConnected. |
Boolean |
||
Wdft_IsGamer |
Indicates whether the device is a gamer device or not based on its hardware combination. |
Boolean |
||
Wdft_RegionIdentifier |
Region Code |
Categorical |
Replaced any null values with its mode 10 |
In order to understand the quality of the data, we did an extensive review of the data and determined which fields should be considered categorical, continous, or boolean. We converted 23 id and category fields to object as well as removed outliers to make values boolean.
In order to help provide easier development, we created 4 array of column names: cols_booleans, cols_numerical, cols_categorical, and cols_categorical_large. The values with over 100 possible values were moved to its own bucket, cols_categorical_large, requiring special care to review in order to determine if we could cluster values together.
The remaining code are steps we took to clean our data.
# Load Column groups to help with data wrangling
%run -i ColumnArrays.py
print( "cols_booleans" )
display( cols_booleans )
print( "cols_numerical" )
display( cols_numerical )
print( "cols_categorical" )
display( cols_categorical )
print( "cols_categorical_large" )
display( cols_categorical_large )
#Convert features to right data type
final[cols_categorical] = final[cols_categorical].astype(object)
final[cols_categorical_large] = final[cols_categorical_large].astype(object)
The following section outlines the percentage of missing data and fields with frequency of same values. A benefit of reviewing the percentage of features with the same value is to see which fields we can easily provide median or mode values and others that may be hard to provide a good value for.
The top 3 fields, PUAMode, Census_ProcessorClass, and DefaultBrowsersIdentifier had over 93% of values missing. These values were exluded from our report due to lack of data in order to predict our hypotheses.
#create new summary table
temp_data = []
for col in final.columns:
temp_data.append((col, final[col].isnull().sum() * 100 / final.shape[0],
final[col].value_counts(normalize=True, dropna=False).values[0] * 100,
final[col].dtype))
temp_data = pd.DataFrame(temp_data, columns=['Attribute','Missing_values [%]',
'Higher_frequency_data [%]','Data_type'])
#sort by Freqency values
display( temp_data.loc[temp_data["Missing_values [%]"] > 0,:]
.sort_values(by =[temp_data.columns[1]], ascending=False) )
#sort by Freqency values
display( temp_data.loc[temp_data["Higher_frequency_data [%]"] > 80,:]
.sort_values(by =[temp_data.columns[2]], ascending=False) )
The dataset was analyzed and three key classifications of data were utilized:
Categorical Variables: The process followed for categorical was to identify its mode for each feature to replace with. However, to try and prevent as much bias in our data, we would factor in another feature to help classify our features. For example, OsBuildLab missing values were derived from OSBuild and Processor.
Continous Variables: All the continous variables' values were derived from its median by grouping Census_MDC2FormFactor. We were able to classify how much RAM a machine has based on if it was a Tablet or a Server.
Boolean Variables: Any values missing or we felt was an outlier was imputed witht he value of 0.
# Smart Screen fill miising values and fix characters issue
final.SmartScreen.fillna('ExistsNotSet', inplace=True)
final.SmartScreen.replace({"off":"Off","00000000":"ExistsNotSet","" :"ExistsNotSet",
"" :"ExistsNotSet"},inplace=True)
# currently renamed "Census_PrimaryDiskTypeName" unknown data into one category
final.Census_PrimaryDiskTypeName.replace({"Unspecified":"Other"},inplace=True)
# currently renamed "Census_ChassisTypeName" unknown data into one category
final.Census_ChassisTypeName.replace({"UNKNOWN":"Other","Unknown":"Other","0" :"Other",
"30" :"Other",
"35" :"Other",
"112" :"Other",
"76" :"Other",
"39" :"Other"},inplace=True)
# currently renamed "Census_PowerPlatformRoleName" unknown data into one category
final.Census_PowerPlatformRoleName.fillna('Other', inplace=True)
final.Census_PowerPlatformRoleName.replace({"UNKNOWN":"Other"},inplace=True)
#IsFlightsDisabled and IsFlightingInternal
final.Census_IsFlightingInternal.fillna("1", inplace=True)
final.Census_IsFlightsDisabled.fillna("1", inplace=True)
#Wdft_RegionIdentifier
final.Wdft_RegionIdentifier.astype(float,inplace=True)
final.Wdft_RegionIdentifier.fillna("10",inplace=True)
final.Wdft_RegionIdentifier.astype(object,inplace=True)
#OrganizationIdentifier
final.OrganizationIdentifier.fillna("0",inplace=True)
#RtpStateBitfield
final.RtpStateBitfield.fillna("-1",inplace=True)
#CityIdentifier
final.CityIdentifier.astype(float,inplace=True)
final.CityIdentifier.fillna("92213",inplace=True)
final.CityIdentifier.astype(object,inplace=True)
#Census_OSInstallLanguageIdentifier unable to classify the value
final.Census_OSInstallLanguageIdentifier.fillna("Unknown",inplace=True)
#AVProductsInstalled 1 was the highest mode
final.AVProductsInstalled.astype(float, inplace=True)
final.AVProductsInstalled.fillna(1,inplace=True)
final.AVProductsInstalled.astype(object, inplace=True)
#AVProductsEnabled default to 0 if no value
final.AVProductsEnabled.astype(float,inplace=True)
final.AVProductsEnabled.fillna(0,inplace=True)
#AVProductStatesIdentifier default to 53447 highest mode
final.AVProductStatesIdentifier.astype(float, inplace=True)
final.AVProductStatesIdentifier.fillna(53447,inplace=True)
final.AVProductStatesIdentifier.astype(object, inplace=True)
#GeoNameIdentifier167.0
final.GeoNameIdentifier.fillna("167",inplace=True)
#OsBuildLab match by OSBuild and Processor
final.OsBuildLab.fillna("17134.1.amd64fre.rs4_release.180410-1804",inplace=True)
#IeVerIdentifier default -1, unable to classify
final.IeVerIdentifier.fillna("-1",inplace=True)
#Defautl all boolean nulls to 0
final.IsProtected.fillna(0,inplace=True)
final.Firewall.fillna(0,inplace=True)
final.UacLuaenable.fillna(0,inplace=True)
final.Census_ThresholdOptIn.fillna(0,inplace=True)
final.Census_IsWIMBootEnabled.fillna(0,inplace=True)
final.Census_IsVirtualDevice.fillna(0,inplace=True)
final.Census_IsAlwaysOnAlwaysConnectedCapable.fillna(0,inplace=True)
final.Wdft_IsGamer.fillna(0,inplace=True)
cols = np.append(["Census_MDC2FormFactor"], cols_numerical)
df_grouped = final.groupby("Census_MDC2FormFactor").agg({
"Census_ProcessorCoreCount":"median",
"Census_PrimaryDiskTotalCapacity":"median",
"Census_SystemVolumeTotalCapacity":"median",
"Census_TotalPhysicalRAM":"median",
"Census_InternalPrimaryDiagonalDisplaySizeInInches":"median",
"Census_InternalPrimaryDisplayResolutionHorizontal":"median",
"Census_InternalPrimaryDisplayResolutionVertical":"median",
"Census_InternalBatteryNumberOfCharges":"median"
})
# Find the mode instead of median
cols = [
"Census_FirmwareManufacturerIdentifier",
"Census_FirmwareVersionIdentifier",
"Census_OEMNameIdentifier",
"Census_OEMModelIdentifier",
# "Census_ProcessorModelIdentifier",
"Census_ProcessorManufacturerIdentifier",
"Census_PrimaryDiskTypeName",
"Census_ChassisTypeName"
]
for col in cols:
imputes = final.groupby(["Census_MDC2FormFactor",col]).agg({
"MachineIdentifier":"count"})
imputes.reset_index(level=[col], inplace=True)
# Max count to identify the right firmware
idx = imputes.groupby(["Census_MDC2FormFactor"])['MachineIdentifier'].transform(max) == imputes['MachineIdentifier']
imputes.drop(["MachineIdentifier"], axis=1, inplace=True)
# Merge df_grouped
df_grouped = pd.merge(df_grouped,imputes[idx], left_index = True, right_index = True, how="inner")
# Census_ProcessorModelIdentifier
# Accounting for categories with same counts. Picking the first one
imputes = final.groupby(["Census_MDC2FormFactor","Census_ProcessorModelIdentifier"]).agg({
"MachineIdentifier":"count"})
imputes.reset_index(level=["Census_MDC2FormFactor","Census_ProcessorModelIdentifier"], inplace=True)
# Max count to identify the right firmware
idx = imputes.groupby(["Census_MDC2FormFactor"])['MachineIdentifier'].transform(max) == imputes['MachineIdentifier']
# Merge df_grouped
imputes = imputes[idx]
imputes['row_id'] = imputes.groupby(['Census_MDC2FormFactor']).cumcount()+1
imputes = imputes.loc[imputes["row_id"]==1,]
imputes.set_index('Census_MDC2FormFactor',inplace=True)
imputes.drop(["MachineIdentifier","row_id"], axis=1, inplace=True)
df_grouped = pd.merge(df_grouped,imputes, left_index = True, right_index = True, how="inner")
for name, group in df_grouped.items():
for key, value in group.items():
mask = final["Census_MDC2FormFactor"]==key
# print( name + "." + key + " Value: " + str(value) )
# print( "Updating: " + str(final.loc[final[name].isnull() & mask, name].shape) )
final.loc[final[name].isnull() & mask, name] = value
final[name].replace(-1,value,inplace=True)
# Save Clean File
# final.to_csv("data/clean.final.csv")
The features below is represents the hardware configurations on a machine. Reviewing each data set, it appears majority of the machines in our data are personal desktops. They are a typical desktop with 4 cores, 4 gigabytes of memory, and standard resolution on the monitor. The majority are desktops since majority has no battery charges.
This data gives us a better understanding what our population of users could be. One of the hardest aspect of this dataset is not knowing what type of user would be suspectable to Malware. It appears these machines are individuals vs. server or custom corproate builds.
final[cols_numerical].describe()
There is also some evidence that machines with higher performance may be more vulnerable to malware.
cols = np.append(["HasDetections"],cols_numerical)
final[cols].groupby(by='HasDetections').mean()
To better understand its higher mean, we perform visualization for different hardware configurations between machines having a detection or not. The box plots were view on the log scale, with the exception of Census_ProcessorCoreCount.
Visually, only Census_PrimaryDiskTotalCapacity shows a difference of its mean and median between machines being attacked.
#Most of our data has processor count <10
final.boxplot(column='Census_ProcessorCoreCount', by = 'HasDetections')
plt.suptitle("")
ax = final.boxplot(
column='Census_PrimaryDiskTotalCapacity',
by = 'HasDetections')
ax.set_yscale('log')
plt.suptitle("")
ax = final.boxplot(
column='Census_SystemVolumeTotalCapacity',
by = 'HasDetections')
ax.set_yscale('log')
plt.suptitle("")
ax = final.boxplot(
column='Census_TotalPhysicalRAM',
by = 'HasDetections')
ax.set_yscale('log')
plt.suptitle("")
Below are the mode statistics for our categorical variables. From our review of the different categories, we found that the values are heavily weighted to 1 or 2 values. We believe this is because there are many different scenarios that make a machine vulnerable. We hope through unsupervised clustering on categorical features, we can learn more about how the data explains the store of detections.
for col in cols_categorical:
display(final.groupby(col).agg({'MachineIdentifier':"count"}) )
In the analysis, AvSigVersion shows the majority of machines using the latest 2 of the 3 builds (1.275 & 1.273). Domain knowledge would tell us that new definition set would have the latest detections for malware, but data tells us otherwise.
# Extrac 1.XX Release values
AvSigVersion_split = final["AvSigVersion"].str.rsplit(pat=".",expand=True)
final["AvSigVersion_1_x"] = AvSigVersion_split.loc[:,0]+"."+AvSigVersion_split.loc[:,1]
final["AvSigVersion_1_x_xx"] = AvSigVersion_split.loc[:,0]+"."+AvSigVersion_split.loc[:,1]+"."+AvSigVersion_split.loc[:,2].str[:2]
av_sig = pd.crosstab(
[
final["AvSigVersion_1_x"]
],
final.HasDetections.astype(bool)
)
av_sig.plot(kind='barh' , figsize = (10,30))
A more detailed breakdown of AvSigVersion_1_x of 1.273 & 2.75 shows that users are not upgrading their builds thus increasing their risk of an attack.
filted = final.AvSigVersion_1_x.isin(['1.275','1.273'])
# filtered = final.AvSigVersion_1_x == "" | final.AvSigVersion_1_x == ""
av_sig = pd.crosstab(
[
final.loc[filted,"AvSigVersion_1_x_xx"]
],
final.HasDetections.astype(bool)
)
av_sig.plot(kind='barh',figsize = (10,30), )
The below plot for shows that AVProductsInstalled cateogory value 1 and 2 have different success rate for attacks on the machine. We do believe this data is ordinal and related to the level of support the AV has or another competetive product. However, Microsoft did not release this information.
av_enabled = pd.crosstab(
[
final["AVProductsInstalled"]
],
final.HasDetections.astype(bool)
)
av_enabled.plot(kind='bar')# , figsize = (10,30))
Smart screen is a feature offered on Defender that helps block malware through your web browser. The data provided had 6 categories, however the Defender application only shows "Block", "Warn", or "Off". We plotted the data again showing the percentage of machines enabling Smart Screen who had dections on their machine for malware.
The chance of a succesful comromise with Malware in Enable mode decreases by about 50% as compared to without SmartScreen enabled.
snart_screen = pd.crosstab(
[
final["SmartScreen"]
],
final.HasDetections.astype(bool)
)
snart_screen.plot(kind='bar')# , figsize = (10,30))
final_save=final.copy()
df_ss=final_save[['SmartScreen','HasDetections']].copy()
df_ss.SmartScreen.replace({"Block":"Enable","ExistsNotSet":"Disable","Off" :"Disable",
"Prompt" :"Enable","RequireAdmin":"Enable","Warn":"Enable"},inplace=True)
df_ss.SmartScreen.value_counts()
df_grouped_ss=df_ss.groupby(by=['SmartScreen'])
detection_rate_ss=df_grouped_ss.HasDetections.sum()/df_grouped_ss.HasDetections.count()
ss = pd.crosstab(
[df_ss["SmartScreen"].astype(object)],
df_ss.HasDetections.astype(bool))
ss_rate=ss.div(ss.sum(1).astype(float),axis=0)
ss_rate.plot(kind='barh',stacked=True)
The original submission for Explore Joint Attributes are replaced with the following updates.
Hardware configuration is investigated for joint attributes exploration. Correlation among various hardware configurations are analyzed.
Interesting relationship is also found between machine's FormFactor to ProcessorCoreCount, PrimaryDiskTotalCapacity, SystemVolumeTotalCapacity, TotalPhysicalRAM, InternalPrimaryDiagonalDisplaySizeInInches, and TotalResolution. By focusing of the median values of the mentioned Hardware configuration, classification can be made for machine Form Factor.
#from pandas.plotting import scatter_matrix
# scatter_matrix(final[cols_numerical], alpha=0.2, figsize=(30,30), diagonal='kde')
import seaborn as sns; sns.set(style="ticks", color_codes=True)
# df = (final[cols_numerical]-final[cols_numerical].mean())/(final[cols_numerical].std())
# display(scl_numericals.describe())
df = final[np.append("Census_MDC2FormFactor",cols_numerical)].copy()
# df["Census_MDC2FormFactor"] = final["Census_MDC2FormFactor"]
# # Apply Log Transformation
df["Census_SystemVolumeTotalCapacity"] = np.log(df.Census_SystemVolumeTotalCapacity)
df["Census_TotalPhysicalRAM"] = np.log(df.Census_TotalPhysicalRAM)
df["Census_PrimaryDiskTotalCapacity"] = np.log(df.Census_PrimaryDiskTotalCapacity)
# Simplify columns names for easier visual analysis
df.columns = [col.split("_",1)[1] for col in df.columns]
df.rename(index=str, columns={"InternalPrimaryDisplayResolutionHorizontal": "H_Resolution",
"InternalPrimaryDisplayResolutionVertical": "V_Resolution",
"InternalPrimaryDiagonalDisplaySizeInInches": "DisplaySize"}, inplace=True)
Below scatter plot shows the relationship among all numerical features which are basically consists of Machine configuration: Processor Count, Primary Disk Total Capacity, System Volume Total Capacity, Total Physical RAM, InternalPrimaryDiagonalDisplaySizeInInches, InternalPrimaryDisplayResolutionHorizontal, and InternalPrimaryDisplayResolutionVertical.
The resolution features InternalPrimaryDisplayResolutionHorizontal and InternalPrimaryDisplayResolutionVertical is highly correlated. In order to adjust for its correlation, we will create a new variable TotalResolution by multiplying the 2 values.
There is also correlation between SystemVolumeTotalCapacity and PrimaryDiskTotalCapacity. There also appears to be an outlier where Primary Volume is Less then System Volume. Since both Disk Capacity values are correlated, we will select PrimaryDiskTotalCapacity during our modeling.
As expected as well as a relationship between Census_ProcessorCoreCount and Census_TotalPhysicalRAM. The majority of the data is exteremly right skewed therefore the data needed to be log transformed. Addtionaly, the visual showing of "zebra" stripes indicate this data may be better suited as a categorical variable.
sns.set(font_scale=1.05)
sns.pairplot(df)
plt.show()
#Creating new Feature "TotalDisplayResolution"
df["TotalDisplayResolution"] = df.H_Resolution * df.V_Resolution
del df["H_Resolution"]
del df["V_Resolution"]
del df["SystemVolumeTotalCapacity"]
By focusing of the median values of the mentioned Hardware configuration, classification can be made for machine Form Factor.
df_grouped = df.groupby("MDC2FormFactor").agg({
"ProcessorCoreCount":"median",
"PrimaryDiskTotalCapacity":"median",
"TotalPhysicalRAM":"median",
"DisplaySize":"median",
"TotalDisplayResolution":"median"
})
df_grouped
Below is a heatmap of the Hardware configuration. We are able to confirm our relationship with Processor Cores and Physical memory with correlation of >0.6 We will chose to use both values as they explain different things on a machine. Other than that, there is no additional features that highly correlated.
# Generate a mask for the upper triangle
sns.set(font_scale=1)
mask = np.zeros_like(df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
ax = sns.heatmap(df.corr(), mask=mask, cmap=cmap, vmax=1, center=0,
square=False, linewidths=.5, cbar_kws={"shrink": .5},annot=True)
There are several fields of AVInstalled and AVEnabled that are cateogirized as 0,1,2, etc. but we are not sure what this data means. In order to understand these fields we created a heatmap between different features (excluding builds) to determine if there could be a relationship. We separate the heatmap for positive correlation and negative correlation for easy viewing.
The heatmap below shows possitive correlation of ~0.6 for FlightRing_NOT_SET to ActivationChannel_Volume and DownloadNotify to ActivationChannel_Retail.
# Create Dummy Variables
RtpStateBitfield = pd.get_dummies(final["RtpStateBitfield"],prefix="RtpStateBitfield")
AVProductsInstalled = pd.get_dummies(final["AVProductsInstalled"],prefix="AVProductsInstalled")
AVProductsEnabled = pd.get_dummies(final["AVProductsEnabled"],prefix="AVProductsEnabled")
SmartScreen = pd.get_dummies(final["SmartScreen"],prefix="SmartScreen")
Census_FlightRing = pd.get_dummies(final["Census_FlightRing"],prefix="Census_FlightRing")
Census_OSWUAutoUpdateOptionsName = pd.get_dummies(final["Census_OSWUAutoUpdateOptionsName"],prefix="Census_OSWUAutoUpdateOptionsName")
Census_GenuineStateName = pd.get_dummies(final["Census_GenuineStateName"],prefix="Census_GenuineStateName")
Census_ActivationChannel = pd.get_dummies(final["Census_ActivationChannel"],prefix="Census_ActivationChannel")
df = pd.concat(
(
final[["Firewall","IsProtected","HasDetections"]],
RtpStateBitfield,
AVProductsInstalled,
AVProductsEnabled,
SmartScreen,
Census_FlightRing,
Census_OSWUAutoUpdateOptionsName,
Census_GenuineStateName,
Census_ActivationChannel
), axis=1
)
sns.set(font_scale=3)
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(df.corr(), ax=ax, vmin=0.4)
plt.show()
The heatmap below shows negative correlation of >0.6 for features in the same category. This is expected and we can use them for one hot encoding for futures below:
sns.set(font_scale=3)
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(df.corr(), ax=ax, vmax=-0.4)
plt.show()
The below heatmap shows us that EngineVersion is positively correlated with AVSigVersion. Through our reserch, we learned that AVSigVersion has the latest versions of malware definitions. During our building of the model, we will favor AVSigVersion instead of EngineVersion.
AppVersion_split = final["AppVersion"].str.rsplit(pat=".",expand=True)
final["AppVersion_x_x"] = AppVersion_split.loc[:,0]+"."+AppVersion_split.loc[:,1]#+"."+AppVersion_split.loc[:,2]
AvSigVersion_split = final["AvSigVersion"].str.rsplit(pat=".",expand=True)
final["AvSigVersion_x_x"] = AvSigVersion_split.loc[:,0]+"."+AvSigVersion_split.loc[:,1]
EngineVersion = pd.get_dummies(final["EngineVersion"],prefix="EngineVersion")
AppVersion_x_x = pd.get_dummies(final["AppVersion_x_x"],prefix="AppVersion_x_x")
AvSigVersion_x_x = pd.get_dummies(final["AvSigVersion_x_x"],prefix="AvSigVersion_x_x")
df = pd.concat(
(
EngineVersion,
AvSigVersion_x_x,
AppVersion_x_x
), axis=1
)
sns.set(font_scale=3)
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(df.corr(), ax=ax, vmin=0.4)
plt.show()
Negative heatmap below shows negative correlation between AvSigVersion 1.273 and 1.277. This shows that a system that is upgraded to AvSigVersion1.277 is normally has the Revision of 1.273. Also, a similar negative correlation is seen for EngineVersion 1.1.15200.1 and 1.1.15300.6. From the previous plot, we know that is expected given the high positive correlation between AvSigVersion to EngineVersion.
sns.set(font_scale=3)
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(df.corr(), ax=ax, vmax=-0.4)
plt.show()
Further details from the boxplots, the density of Detections for Diskspace and Memory was analyzed.. In the violin plot for Census_TotalPhysicalRAM . We can visually see that Median server, as the memory increases there is a higher likely hood of vulnerability. The remaining servers, with the exception of Large (low detections), memory is not a high factor between detections. The feature Census_PrimaryDiskTotalCapacity also shows a similar pattern, where if you are a server with more diskspace you are more likely to be attacked by malware.
cmap = sns.diverging_palette(220, 10, as_cmap=True)
f, ax = plt.subplots(figsize=(20,30))
final["Census_TotalPhysicalRAM_gb"] = final.Census_TotalPhysicalRAM / 1000
sns.violinplot(x="Census_MDC2FormFactor", y="Census_TotalPhysicalRAM_gb", hue="HasDetections", data=final,
split=True, inner="quart")
cmap = sns.diverging_palette(220, 10, as_cmap=True)
f, ax = plt.subplots(figsize=(20,30))
sns.violinplot(x="Census_MDC2FormFactor", y="Census_PrimaryDiskTotalCapacity", hue="HasDetections", data=final,
split=True, inner="quart")
While researching different builds of Defender, we learned there is the version of the actual Application, AppVersion, along with definition version to detect Malware, AvSigVersion, and Engine Version EngineVersion. In order to understand its relationship between detections, we compared combinations of different versions. Visually the 2 largest groups of version combinations has the make chance for an attack. This may be an indication that attackers know what the adoption rate for versions and targeting these versions.
# get only the 4.X values
AppVersion_split = final["AppVersion"].str.rsplit(pat=".",expand=True)
final["AppVersion_reduce"] = AppVersion_split.loc[:,0]+"."+AppVersion_split.loc[:,1]#+"."+AppVersion_split.loc[:,2]
defenders = pd.crosstab(
[
final["EngineVersion"].fillna(1),
final["AppVersion_reduce"].fillna(1),
final["AvSigVersion_1_x"]
],
final.HasDetections.astype(bool)
)
defenders.plot(kind='barh',stacked=False, figsize = (10,70))
Through domain knowledge we know firewall is important to prevent attacks. In order to visualize if a Firewall is an important feature, we grouped the variable Firewall with AVProductsEnabled to determine its effectiveness. The rate for each group is very similar. This is because user behavior indicates more attacks are done by downloads versus straight attacks on the operating system.
final['AVProductsEnabled'] = final['AVProductsEnabled'].astype(int)
final['av_status']=pd.cut(final.AVProductsEnabled,[-1,0,final.AVProductsEnabled.max()],2,labels=['NoAV','AV'])
final['fw_status']=pd.cut(final.Firewall,[-1,0,final.Firewall.max()],2,labels=['NoFW','FW'])
final['Def_status'] = final[['av_status', 'fw_status']].apply(lambda x: '_'.join(x), axis=1)
df_avfw=final[['Def_status','HasDetections']].copy()
df_grouped_avfw=df_avfw.groupby(by=['Def_status'])
detection_rate=df_grouped_avfw.HasDetections.sum()/df_grouped_avfw.HasDetections.count()
detection_rate.plot(kind='bar')
One of the key data points that is absent in this dataset is the actual risk that an attacker would compromise the machine based on known vulnerabilities because simply taking a data set without understanding the vulnerabilities related to exposure, does not tell the entire story. Without this data, the machine learning models will miss critical attributes of information in determining if malware could compromise the host. While the Microsoft competition purpose is to determine if machines can be compromised, it is vital to understand the actual risk of the machine to determine if an attacker can actually deploy malware on a machine. From an attacker standpoint, the methodology that is used to compromise a machine is to research whether a machine has vulnerabilities with tools like nMap or Nessus so they can launch the attack. Attackers cannot deploy malware without exploiting vulnerabilities and only understanding the operating system levels, anti-virus dat versions, operating system configuration has no real context from an attacker standpoint to determine whether a machine can be compromised.
The first data element that is missing is the Common Vulnerabilities and Exposures (CVE) data to determine if the machine is vulnerable. In order to gather this data, the OS build field will need to be compared to the Microsoft version number. For example the Build number of 17763.253 matches to Version 1809 (e.g.Microsoft Build to Version Conversion). The second step is to find all the CVE's related with that version of Operating System. For example, version 1809 has 73+ related CVEs. (e.g.Microsoft Version to Listing of CVE Vulnerabilites). Once this data is obtained, the data can be aggregated by severity based on the CVE Score with the highest exploitable vulnerability as the rating.
# Include new features
ms_versions = pd.read_csv("data/vulnerabilities.csv")
final["os_build"] = final.Census_OSVersion.str.slice(5,10).astype(int)
final = pd.merge(final, ms_versions, on='os_build', how='left')
ms_versions[cols_cve].drop_duplicates()
The second data point that is missing is related to the Antivirus program. The "EngineVersion", "AppVersion", and "AVSigVersion" all need to be transformed related to how far they are out of date compared to the current date. While attackers use vulnerabilities to deploy malware, is a system is out of date, the machine will have a higher degree of susceptibility.
The last data element that is missing is related to the firewall. While under most circumstances’ firewalls are a good way at protecting a machine, the exploitability factors are not related to whether the firewall, inbound or outbound, is on or off, but the determination of what ports or services are open on the firewall external to the machine. However, this data cannot be captured through an external data source because this is something local to the machine that would have had to be captured as part of the original dataset.
SUCCESS MEASURE You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.
For the exceptional work bonus points, we believe that the subject matter expertise leveraged in the business understand in this report. We performed additional research to actually understand the problem that Microsoft is facing in this Kaggle competition and presented the challenges of the cybersecurity maket and attack landscape in context to the entire industry and the problem that Microsoft is trying to solve.
The second piece of exceptional work in this report is related to the additional features that need to be captured to understand whether or not a machine is vulnerability from a security perspective. Three key features are missing which include total vulnerabilities, Antivirus program aging, and open firewall ports and services. All of these data elements would tremendously help aid in the predication of whether the machine could be compromised with malware. As part of our EDA, the team did bring in the data set that compared the OS and Patch version to vulnerabilities. While this data is important for this stage in the project, we fully expect that this will help with our prediction in the future classes
The thrid piece of exception work was data mining and munging the existing dataset. Since the data set contain over 300,000 rows and 82 variables, we spent a lot of time cleaning the data. We were able to delete columumns that had over 95% invalid data and munged the data in the columns that were missing data and inserted either median or mode data depending on whether is was numberical or catagorical.